PUBCRAWL: Protecting Users and Businesses from CRAWLers

نویسندگان

Grégoire Jacob

Engin Kirda

Christopher Krügel

Giovanni Vigna

چکیده

Web crawlers are automated tools that browse the web to retrieve and analyze information. Although crawlers are useful tools that help users to find content on the web, they may also be malicious. Unfortunately, unauthorized (malicious) crawlers are increasingly becoming a threat for service providers because they typically collect information that attackers can abuse for spamming, phishing, or targeted attacks. In particular, social networking sites are frequent targets of malicious crawling, and there were recent cases of scraped data sold on the black market and used for blackmailing. In this paper, we introduce PUBCRAWL, a novel approach for the detection and containment of crawlers. Our detection is based on the observation that crawler traffic significantly differs from user traffic, even when many users are hidden behind a single proxy. Moreover, we present the first technique for crawler campaign attribution that discovers synchronized traffic coming from multiple hosts. Finally, we introduce a containment strategy that leverages our detection results to efficiently block crawlers while minimizing the impact on legitimate users. Our experimental results in a large, wellknown social networking site (receiving tens of millions of requests per day) demonstrate that PUBCRAWL can distinguish between crawlers and users with high accuracy. We have completed our technology transfer, and the social networking site is currently running PUBCRAWL in production.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Social Honeynets: Trapping Web Crawlers in OSN

Web crawlers are complex applications that explore the Web with different purposes. Web crawlers can be configured to crawl online social networks (OSN) to obtain relevant data about its global structure. Before a web crawler can be launched to explore the web, a large amount of settings have to be configured. This settings define the behavior of the crawler and have a big impact on the collect...

متن کامل

Web Crawler: Extracting the Web Data

Internet usage has increased a lot in recent times. Users can find their resources by using different hypertext links. This usage of Internet has led to the invention of web crawlers. Web crawlers are full text search engines which assist users in navigating the web. These web crawlers can also be used in further research activities. For e.g. the crawled data can be used to find missing links, ...

متن کامل

Defining Evaluation Methodologies for Topical Crawlers

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. We have argued that general evaluation methodologies a...

متن کامل

Evolutionary pattern, operation mechanism and policy orientation of low carbon economy development

The essence of low carbon economy development is a continuous evolution and innovation process of socio-economic system from traditional high carbon economy to new sustainable green low carbon economy to achieve a sustainable dynamic balance and benign interactive development of various elements between society, economy and natural ecosystem. At the current stage, China’s socio-economy is showi...

متن کامل

Don't Tread on Me: Moderating Access to OSN Data with SpikeStrip

Online social networks rely on their valuable data stores to attract users and produce income. Their survival depends on the ability to protect users’ profiles and disseminate it to other users through controlled channels. Given the sparse user adoption of privacy policies, however, there is increasing incentive and opportunity for malicious parties to extract these datasets for profit using au...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

PUBCRAWL: Protecting Users and Businesses from CRAWLers

نویسندگان

چکیده

منابع مشابه

Online Social Honeynets: Trapping Web Crawlers in OSN

Web Crawler: Extracting the Web Data

Defining Evaluation Methodologies for Topical Crawlers

Evolutionary pattern, operation mechanism and policy orientation of low carbon economy development

Don't Tread on Me: Moderating Access to OSN Data with SpikeStrip

عنوان ژورنال:

اشتراک گذاری